XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents
نویسندگان
چکیده
Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.
منابع مشابه
خوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملRetrieving Video Segments Based on Combined Text, Speech and Image Processing
This paper describes a multimedia, multilingual and multimodal research system (CIMWOS) supporting content-based indexing, archiving, retrieval and ondemand delivery of audiovisual content. There are several projects, aiming at developing advanced technologies and systems to tackle the problems encountered in multimedia archiving and indexing [8], [9], [10]. CIMWOS [1] (Combined IMage and WOrd ...
متن کاملEmbedding Knowledge in Web Documents: CGs versus XML-based Metadata Languages
The paper argues for the use of general and intuitive knowledge representation languages for indexing the content of Web documents and representing knowledge within them. We believe these languages have advantages over metadata languages based on the Extensible Markup Language (XML). Indeed, the representation and retrieval of precise information is better supported by languages designed to rep...
متن کاملA Bayesian Approach to WSD for the Retrieval of XML Documents
Sources of XML documents are today proliferating on the World Wide Web. An important feature of XML is that information on documents structures is available on the Web together with the documents contents. This information can be exploited to improve document handling and to improve query processing. In such an heterogeneous environment as the Web, it is not reasonable to assume that there are ...
متن کاملMapping Xml to Existing Owl Ontologies
Now-a-days, XML has reached a wide recognition and brought interoperability at a syntactic level. Unfortunately, even when using XML to represent data, problems arise when it is necessary to integrate different data sources because XML lacks support for efficient sharing of conceptualization. Emerging Semantic Web technologies, such as ontologies, can enable semantic interoperability. With onto...
متن کامل